Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support bundle checkpoint / preemptible workers #3882

Merged
merged 78 commits into from
Apr 14, 2022
Merged

Support bundle checkpoint / preemptible workers #3882

merged 78 commits into from
Apr 14, 2022

Conversation

epicfaace
Copy link
Member

@epicfaace epicfaace commented Nov 10, 2021

For documentation on how to use this feature, see: https://github.com/codalab/codalab-worksheets/blob/6423cc3ba1e6d9779224c6c66e97b59a9c208197/docs/Checkpoints.md

Original design doc: https://docs.google.com/document/d/1ifcuXLPldSWJehZgl9RTmPogllbyUJ_vRIOiW2BF1i0/edit#

Checklist:

  • Add --preemptible flag to cl-worker and --worker-preemptible flag to cl-worker-manager
  • Update bundle manager logic so that: - Ashwin Ramaswami
    • Worker checkins include the “preemptible” field
    • A bundle in worker_offline that is assigned to a preemptible worker will be moved to staged
  • Add tests in test_cli.py to ensure this logic works

codalab/model/bundle_model.py Outdated Show resolved Hide resolved
codalab/model/bundle_model.py Show resolved Hide resolved
@epicfaace epicfaace marked this pull request as ready for review December 2, 2021 16:05
@epicfaace epicfaace changed the title Bundle checkpoint / preemptible changes Support bundle checkpoint / preemptible workers Dec 2, 2021
@@ -1000,15 +1010,20 @@ def transition_bundle_worker_offline(self, bundle):
# The user deleted the bundle or the bundle finished
return False

if getattr(bundle.metadata, "preemptible", False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we access the preemptible field directly and have to use getattr?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that if we do bundle.metadata["preemptible"], this will break bundles that started running in the previous deploy (but continue running during the deploy) that don't yet have this metadata key yet.

docs/Checkpoints.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@percyliang percyliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have missed it, but I didn't see any logic on the worker side on how bundles are resumed. How will the resumed bundle be run in the same working directory as its previous incarnation?

I do think that the bundle needs a preemptible flag, because in some cases, you don't want your job to be resumed from a partially completed working directory because you'll get invalid results. In those cases, the bundle should not be restaged. Otherwise, I'm afraid that we're going to get into some bad infinite loops.

@@ -1000,15 +1010,20 @@ def transition_bundle_worker_offline(self, bundle):
# The user deleted the bundle or the bundle finished
return False

if getattr(bundle.metadata, "preemptible", False):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that if we do bundle.metadata["preemptible"], this will break bundles that started running in the previous deploy (but continue running during the deploy) that don't yet have this metadata key yet.

codalab/model/bundle_model.py Show resolved Hide resolved
@epicfaace
Copy link
Member Author

@percyliang @teetone One more thing -- with this PR, the bundle stats (such as time taken) only include the time for the last worker that ran a bundle (rather than the total time across all workers that ran a bundle). We might want to update that in the future, though, as that's just nice-to-have functionality that could take longer to implement.

@epicfaace epicfaace requested a review from wwwjn April 12, 2022 20:36
@epicfaace
Copy link
Member Author

@teetone @wwwjn tests pass now, this is ready for review.

Copy link
Collaborator

@teetone teetone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went over the PR offline with Ashwin

@mergify mergify bot merged commit e89d334 into master Apr 14, 2022
@mergify mergify bot deleted the checkpoint branch April 14, 2022 16:17
@epicfaace epicfaace mentioned this pull request Apr 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bundle manager changes required for model checkpoints
3 participants